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Abstract. We consider the problem of finding repetitive structures and 
inherent patterns in a given string s of length n over a finite totally 
ordered alphabet. A border u of a string s is both a prefix and a suffix 
of s such that u ^ s. The computation of the border array of a string s, 
namely the borders of each prefix of s, is strongly related to the string 
matching problem: given a string w, find all of its occurrences in s. 

A Lyndon word is a primitive word (i.e., it is not a power of another 
word) which is minimal for the lexicographical order of its conjugacy 
class (i.e., the set of words obtained by cyclic rotations of the letters). 
In this paper we combine these concepts to introduce the Lyndon Bor¬ 
der Array C/3 of s, whose i-th entry £/3(s)[i] is the length of the longest 
border of s[l.. i] which is also a Lyndon word. We propose linear-time 
and linear-space algorithms 1 for computing C/3(s). Further, we intro¬ 
duce the Lyndon Suffix Array, and by modifying the efficient suffix array 
technique of Ko and Aluru [KA03] outline a linear time and space algo¬ 
rithm for its construction. 


1 Introduction 

Understanding complex patterns and repetitive structures in strings is essential 
for efficiently solving many problems in stringology [CR02]. For instance, Lyn¬ 
don words are increasingly a fundamental and applicable form in the study of 
combinatorics on words [Lot83], [Lot05], [Smy03] - these patterned words have 
deep links with algebra and are rich in structural properties. Another important 
concept is a border u of a string s defined to be both a prefix and a suffix of s 
such that u ^ s. The computation of the border array of a string s, that is of 
the borders of each prefix of s, is strongly related to the string matching prob¬ 
lem: given a string w, find all of its occurrences in a string s. It constitutes the 
“failure function” of the Morris-Pratt (1970) string matching algorithm [MP70]. 

Lyndon words were introduced under the name of standard lexicographic se¬ 
quences [Lyn54,Lyn55] in order to construct a basis of a free Abelian group. Two 

1 The algorithms presented in this paper can be applied to computing the co-Lyndon 

Border Array by observing that a word u = U 1 U 2 ■ ■ ■ u n is a co-Lyndon word if the 

reversed word u = u n u n -i • • • ui is a Lyndon word. 
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strings are conjugate if they differ only by a cyclic permutation of their char¬ 
acters; a Lyndon word is defined as a (generally) finite word which is strictly 
minimal for the lexicographic order of its conjugacy class. For a non-letter Lyn¬ 
don word w , the pair (it, v) of Lyndon words such that w = uv with v of 
maximal length is called the standard factorization of w. 

The set of Lyndon words permits the unique maximal factorization of any 
given string [CFL58,Lot83]. In 1983, Duval [Duv83] developed an algorithm for 
standard factorization that runs in linear time and space - the algorithm cleverly 
iterates over a string trying to find the longest Lyndon word; when it finds one, 
it adds it to the result list and proceeds to search in the remaining part of the 
string. 

Lyndon words proved to be useful for constructing bases in free Lie algebras 
[Reu93], constructing de Bruijn sequences [FM78], computing the lexicographi¬ 
cally smallest or largest substring in a string [AC95], succinct suffix-prefix match¬ 
ing of highly periodic strings [NS13]. Wider ranging applications include the 
Burrows-Wheeler transform and data compression [GS12], musicology [Che04], 
bioinformatics [DR04], and in relation to cryptanalysis [Per05]. Indeed the uses, 
and hence importance, of Lyndon words are increasing, and so we are motivated 
to investigate specialized Lyndon data structures. 

The key contributions of this paper are as follows. 

— By combining the important concepts of Lyndon words and borders of strings, 
we introduce here the Lyndon Border Array Cjd of s, whose i-th entry 
£/3(s)[i] is the length of the longest border of s[l.. i\ which is also a Lyndon 
word. We present an efficient linear time and space algorithm for computing 
the Lyndon Border Array C(3 for a given string (Section 4). 

— In order to achieve the desired level of efficiency in the Lyndon Border Ar¬ 
ray construction we also present some interesting results related to Lyndon 
combinatorics, which we believe is of independent interest as well (Section 3). 

— A complementary data structure, the Lyndon Suffix Array, which is an adap¬ 
tation of the classic suffix array, is also defined; by modifying the linear-time 
construction of Ko and Aluru [KA03] we similarly achieve a linear construc¬ 
tion for our Lyndon variant (Section 5). We also present a simpler algorithm 
to construct a Lyndon Suffix Array from a given Suffix Array (Section 5.1). 
The latter algorithm also runs in linear time and space. 


2 Basic Definitions and Notation 

Consider a finite totally ordered alphabet E which consists of a set of characters 
(equivalently letters or symbols). The cardinality of the alphabet is denoted by 
|A|. 

A string (word) is a sequence of zero or more characters over an alphabet 
E. A string s of length |s| = n is represented by s[l . .n], where s[i] £ E for 
1 < i < n. The set of all non-empty strings over the alphabet E is denoted by 
E + , and the set of strings of length n by E n . The empty string is the empty 
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sequence of characters (with zero length) denoted by e, with E* = E + U e; we 
write all strings in mathbold: u , v, and so on. 

The i-th symbol of a string s is denoted by s ['«]. or simply S;. We denote 
by s[i. .j], or Sj • • • Sj, the substring of s that starts at position i and ends at 
position j. 

A string w is a substring, or factor, of s if s — uwv , where u,v £ E*; 
specifically, a string w = W\ ■ ■ • w m is a substring of s = s\ ■ ■ ■ s n if Wi ■ ■ ■ w m = 
Si • • • Si-f-m -1 for some i. Words to[l..i] are called prefixes of w , and words 
w[i.. n] are called suffixes of w. The prefix u (respectively suffix v) is a proper 
prefix (respectively suffix) of a word w = uv if w ^ u, v. 

For a substring w of s , the string uwv for u,v £ E* is an extension of w 
in s if uwv is a substring of s; wv for v £ E* is the right extension of w in 
s if wv is a substring of s; uw for u £ E* is a left extension of w in s if uw 
is a substring of s. Words that are both prefixes and suffixes of w are called 
borders of w. By border(w ) we denote the length of the longest border of w 
that is shorter than w. 

A word w is periodic if it can be expressed as w = p k p' where p' is a proper 
prefix of p , and k > 2. Moreover, a string is said to be primitive if it cannot be 
be written as u k with u £ E + and k > 2, i.e., it is not a power of another string. 
When p is primitive, we call it “the period” of u. It is a known fact [CHL07] 
that, for any string w , per{w) + border(w) = |m|, where the period per of a 
nonempty string is the smallest of its periods. 

Definition 1. (Border array) For a string s £ E n , the border array /3(s)[l.. n] 
is defined by /?(«)[*] = \border(s[l. ,f])| for 1 < i < n. 

Proposition 2. [MP70] The border of a string s (or the table f3(s) itself) can 
be computed in time 0 (|s|). 

A string y = y[l. ,n\ is a conjugate (or cyclic rotation) of a; = a:[l. .n] if 
y[ 1.. n] = x[i.. n]a;[l.. i — 1] for some 1 < i < n (for i = 1 , y = x). A Lyndon 
word is a primitive word which is minimal for the lexicographical order of its 
conjugacy class (i.e., the set of all words obtained by cyclic rotations of letters). 
Furthermore, a non-empty word is a Lyndon word if and only if it is strictly 
smaller in lexicographical order (lexorder) than any of its non-empty proper 
suffixes [Duv83,Lot83]. 

Throughout this paper, £ will denote the set of Lyndon words over the totally 
ordered alphabet E , C n will denote the set of Lyndon words of length n; hence 
C = {C\ U £2 U £ 3 ,..}. We next list several well-known properties of Lyndon 
words and border arrays which we later apply to develop the new algorithms. 

Proposition 3. [Duv83] A word w £ E + is a Lyndon word if and only if either 
w £ E or w = uv with u, v £ £, u < v. 

Theorem 4. [CFL58] Any word w can be written uniquely as a non-increasing 
product w = U 1 U 2 ■ ■ ■ Uk of Lyndon words. 
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Theorem 4 shows that there is a unique decomposition of any word into 
non-increasing Lyndon words (ui > > • • • > u^). 

Observation 1 Let £ be a Lyndon word (i £ C) where £ = i\ii. .£ n (to avoid 
trivialities we assume n > 1), then 

(1) border(£) = 0, 

(2) h < In, 

(3) l\< li\li £ {£ 2 , . . , in- l}, 

(4) £ < £i ■ ■ ■ £ n , for 1 < i <n. 

Observation 2 Given a string s, then 

(1) P(8)[l]=0, 

(2) if b is a border of s, and b is a border ofb, then b is a border of s, 

(3) 0 < P{s)[i + 1] < P{s)[i] + 1, for 1 < i < n. 

We now introduce the Lyndon Border Array and associated computation, 
illustrated in Example 7 below. 

Definition 5. (Lyndon Border Array) For a string s £ S n , the Lyndon border 
array £/3(s)[i] is the length of the longest border of s[l.. i] which is also a Lyndon 
word. 

Definition 6 . (Lyndon suffix array) For a string s £ £ n , the Lyndon Suffix 
Array of s is the lexicographically sorted list of all those suffixes of s that form 
Lyndon words. 

Given a string s of length n, associated computational problems are: compute 
the Lyndon border and Lyndon suffix arrays; we address these problems in this 
paper. 

Example 7. Consider the string s = abaabaaabbaabaab. The following table il¬ 
lustrate the border array f3 of s, the Lyndon border array Cfi of s, the suffix 
array A of s and the Lyndon suffix array CS of s. 
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0 

0 

i 
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2 

1 

1 

i 

2 

0 

i 

i 

2 

1 

1 

2 
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13 

2 
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14 

3 
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0 

7 

15 

4 

12 

1 

9 
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-1 

-1 

-1 
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-1 

-1 
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3 Lyndon Combinatorics 

This section introduces some new interesting combinatorial results on Lyndon 
words. In relation to the computation of the Lyndon Border Array, we here show 
how to find the shortest prefix of a string that is both border-free and not a Lyn¬ 
don word. So assume that for a given string s of length n, we have s[l] = 7 . 
If fi ,.., f q are factors of s, we use start(fi) (end(fi)) to denote the index of 
/*[ 1 ] (/i[|/i|]) in s, and say that j is an index of fi if start(fi) < j < end(fi). 
An outline of the steps of the algorithm is as follows: 

Algorithm Shortest non-Lyndon Border-free Prefix (SNLBf P). 

1. Compute the Lyndon factorization of s. 

2. Apply binary search to find the first Lyndon factor in the factorization 
starting with the largest letter p which is strictly less than 7 (if it exists). 

3. Consider the maximal prefix p of s in which every factor fi, ■ ■, f q starts 
with 7 ; compute the border array f3(p) of p. 

4. Compute i, the smallest index of p, such that i > end(fi) (i is not an index 
of / 1 ) and P(p)[i] = 0 ; 

(a) if i does not exist then i = end{f q ) + 1 (if it exists) 

(b) if q = 1 then i = end(fi) + 1 (if it exists). 

5. Return s[l.. i]. 

Claim 1 Suppose s[l. .i] is the shortest prefix of s that is both border-free and 
non-Lyndon. Then i > end(fi), i.e., Algorithm SNLBfP is correct in skipping 
the first Lyndon factor. 

Proof (Claim 1). The proof is by induction. Assume that the first Lyndon factor 
/1 is of length m (with m > 2 otherwise the Claim holds trivially). By Observa¬ 
tion 1(1) and Observation 2(1) we have /3(/i)[l] = /3(fi)[m] = 0. The smallest j 
after 1 where /3(/i)[j] = 0 must index a Lyndon word x of the form x = 
where v > 7 . Hence, after the first letter (which is a Lyndon word), the next 
border-free prefix is also a Lyndon word. 

Assume now that all border-free prefixes up to index t of /1 are Lyndon 
words (and hence nested), and suppose that the next border-free position is t'. 
Then we need to show that y = /1 [1.. t ,'] is a Lyndon word; we proceed to 
show that y is less than each of its proper suffixes. Let Wk = fi[t' — k + 1. .t'} 
for 1 < k < t'. Since f± is a Lyndon word and minimal in its conjugacy class, 
then /i[l.. k] < Wk ; further, since /3(fi)[t'} = 0 we cannot have equality and so 
/1 [ 1 .. k] < Wk which implies that y < Wk as required. 

In other words, we have shown that any border-free prefix of /1 is a Lyndon 
word and the result follows. 

□ 


Corollary 8. Any border-free prefix of a Lyndon word is a Lyndon word. 
Lemma 9. Algorithm SNLB/P is correct. 
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Proof (Lemma 9). In Step 2 we identify the factor which starts with the 
letter p < 7. From Lyndon principles, Observation 1(2),(3), it follows that no 
factor to the left of contains the letter p. Hence the prefix s[l.. start(fn)] is 
both border-free and non-Lyndon. However, it may not be the shortest one and 
so the algorithm continues to check through the prefix p. 

Now, consider the index i of p computed in Step 4. Suppose that i is an 
index of the factor ft ; by Claim 1 we have t > 1. Let k be the length of the 
prefix p t of ft that ends at i, and pi be the prefix of fi of length k. By the 
Lyndon factorization we have pi > pt (in lexicographic order). If pi = pt 
then this contradicts (3(p)[i] = 0. Hence pi > pt and so the prefix s[l..i] is 
both border-free and not a Lyndon word. Hence Algorithm SNLB/P correctly 
returns s[l.. i\. 

□ 


Lemma 10. Algorithm SNLB/P runs in 0(n) time. 

Proof (Lemma 10). Step 1 can be computed in 0(n ) time [Duv83,Lot05]. Step 2 
applies an O(logn) binary search. In Step 3 we compute the border array of the 
prefix p of s. Clearly Steps 3 and 4 can be completed in 0(n) time. Hence, the 
result follows. 

□ 

Lemma 11. (Lyndon invalid point for border-free word) Given a string 
£ £ S n , n > 1, such that border (£) = 0, if £ is not a Lyndon word (£ $_ C), then 
all the right extensions if! of £, such that £' £ S + , are not Lyndon words either. 

- We refer to this condition as the C-fail condition and to the point (index) 
of where it occurs as the Lyndon invalid point. 

Proof (Lemma 11). Since i is border-free ( border{£) = 0) but not a Lyndon 
word, then let r be the rotation of £ which is the Lyndon word of the conjugacy 
class. Write £ = pq such that r = qp and qp < pq. 

Case (1) - If |p| = |q|, then since r is border-free and a Lyndon word, q ^ p. 
Further, since qp < pq, we have q < p. It follows that q£'p < pq£' and so 
££' cannot be a Lyndon word. 

Case (2) - If |q| > |p|, then r[l..|p|] < p (i.e., q[ 1.. |p|] < p), we have q < p. 

It follows that q£'p < pq£' and so ££' cannot be a Lyndon word. 

Case (3) - If |q| < |p|. We have r = qp £ £, where q = qiq 2 ' Qj- P = 
P1P2 • • Pk- We are required to show that r' = pq£' ^ £; so suppose that 
r’ £ C. From r we have that q ± < p 1: while from r' we have p ± < q 1: which 
together implies q 1 = p x . From the rotation of r starting p : p 2 we have 
q 2 < p 2 , while from the rotation of r' starting q x q 2 we find that p 2 < q 2 
giving q 2 = p 2 . We continue this argument for |q| elements which shows 
that the given word pq has a border. 

In the first two cases the order is decided within the first |p| elements. 

□ 
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Lemma 12. (Lyndon invalid point for bordered word) Given a string £ £ 
£ n , n > 1, such that border(£) > 0, let {£ n , £' 2 , £' a • • ■ } be right extensions of £ 
by {1, 2,3, • ■ • } characters in £ respectively. Then £ and all its right extensions 
are not Lyndon words if £ ,1 [\£ ,1 \] < £ n \border{£) + 1], 

Similarly to Lemma 11, we refer to this condition as the C-fail condition 
and to the point (index) of where it occurs as the Lyndon invalid point. 

Proof (Lemma 12). The lemma follows from the following two cases: 

1. £ C, this is immediate from the hypothesis that borderif.) > 0 and Obser¬ 
vation 1(1). 

2. Consider the right extension £' m of £, where m > 1. Then the suffix £' m [n — 
border(£) +1 .. n+m] of £ ,m is lexicographically less than £' m \l.. border{£) + 
1] and consequently less than £ ,m , contradicting the property that a Lyn¬ 
don word is strictly smaller than any of its proper suffixes. Hence no right 
extension of £ is a Lyndon word. 

In case (2) the order is decided within the first border (£) + 1 elements. 

□ 

Fact 1 For a given string s £ £ n , suppose we have computed /3(s). Then, for 
1 < i < n, the following holds true: 

C/3{s)[i\ e {P(s)[i\, /3(s)[/3(s)[i]], /3(s)[/3(s)[/3(s)[i]]], .., 0}. 

A binary Lyndon word can also be expressed in terms of Lyndon properties 
of the integer parameters (exponents) given by its Run Length Encoding. For 
a binary string £, let RLE(t! a ) denote the encoding (as a string) of the subse¬ 
quence of £ consisting of all letters a but no letter b (p 1 , pp,..p m ), similarly 
RLE(^b) denotes the encoding of the subsequence of £ consisting of all letters b 
(qr, 3 i;.. ;q m ). 

Proposition 13. For a given string s £ £ + , the Run Length Encoding RLE(s) 
and subsequently, the list IZ(s) can be computed in time and space linear in the 
size of the given string s. 

Now we have the following Proposition to check if a given binary word is a 
Lyndon word or not (using the Run Length Encoding of the binary word). 

Proposition 14. Let £ be the set of Lyndon words over an alphabet £, where 
£ = {a, b} and a < b. For a non-letter word £ = £\£2 ■ ■ ■ £ n and its corresponding 
exponents list 1Z(£), we have the following: 

Case to = 0, we have Pj,qj = 0 for j £ {1. .to} then £ £ £ if and only if 

p',q' > o. 

Case to > 0, we have p', q',pj, qj > 0 then 

1. £ £ £ if and only if p' > pj for j £ {1 .. to}. 

2. if p' = p m and p' > pj for j £ (1.. m — 1} then £ £ £ if and only if 
q' < q m A A < 0. 
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3. if there exists pj such that p' = Pj and p' > p m then £ 6 C if and only if 
q' < <lj A A < 0 for j e {1.. m — 1}. 

where A is the index of Lyndon invalid point i.e. C-fail condition (As defined 
in Lemma. 11 and Lemma. 12). 

Lemma 15. Let t be a binary word with £[\\ = a, £[n\ = b and associated 
encodings RLE(£ a ) and RLE(£b). Then £ is a Lyndon word if and only if, either 

(i) RLE{j£ a ) is a Lyndon word on the alphabet {l>2>3>---} ; or 

(ii) RLE(£ a ) is a repetition of a Lyndon word as in (i) and RLE(£b) is a Lyndon 
word on the alphabet {1 <2<3< - 

Proof (Lemma 20). First we consider necessity. So suppose that (i) holds. Con¬ 
sider any rotation £ r a of £ a (including those with split runs of a’s). Then RLE(£ a ) < 
RLE(l^) in lexorder over {1 > 2 > 3 > ■ - - }. Now suppose that (ii) holds. Then for 
a rotation £ r a of £ a , either RLE(^ a ) RLE(^ a ) m lexordei over (1 2 3 

or RLE(£ a ) = RLE(^^) and RLE(£{,) < RLE(^J() in lexorder over {1 < 2 < 3 < • • • }. 
For sufficiency, the conditions guarantee that £ £ C. 

□ 

Lemma 20 shows that a binary string can be decomposed into its Lyndon fac¬ 
tors with a double application of Duval’s linear factorization algorithm [Duv83], 
first on the exponents of a and then on those for b - hence linear in \R.(£)\ (once 
1Z(£) is constructed in time linear in \l\). It follows that a word can be tested to 
be a Lyndon word in linear time: check whether there is more than one factor 
in the factorization. 

Lemma 16. Suppose we are given a binary string s of length n and the corre¬ 
sponding list of exponents IZ(s). Then we can compute the array <F(s) in 0(n ) 
time. 

Proof (Lemma 19). We first focus on computing the i- th element <F[i](s) of the 
array 'F(s). Clearly, by Proposition 18, it takes constant number of comparisons 
to determine whether the prefix s[l.. i\ is a Lyndon word or not. Since we have 
to repeat the steps in Proposition 18 for n prefixes, therefore <F(s) can be done 
in linear time once IZ(s) is computed. 

□ 


4 Lyndon Border Array Computation 

In this section we develop an efficient algorithm for computing the Lyndon Bor¬ 
der Array. We first recall an interesting relation that exists for borders which we 
refer to as the Chain of Borders henceforth. Since every border of any border 
of s is also a border of s (Observation 2(2)), it turns out that, the border array 
(3(s) compactly describes all the borders of every prefix of s. For every prefix 
s[l.. i] of s, the following sequence 

jS(s) 1 [*], >9(s) 2 [i],.., >S(«) TO [*] 


(1) 
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is well defined and monotonically decreasing to /3(s) m [i] = 0 for some m > 1 
and this sequence identifies every border of s[l. .*]. Here, /3(s) k [i\ is the length 
of the fc-th longest border of s[l..i], for 1 < k < m. Sequence (1) identifies 
the above-mentioned chain of borders. We will also be using the usual notion of 
the length of the chain of borders. Clearly, the length of the chain in Sequence 
(1) is m. Also we use the following notion and notations. In Sequence (1), we 
call (3(s) m [i] = 0 the last value and j3 (s) m-1 [i] the penultimate value in the 
chain. We now present the following interesting facts that will be useful in our 
algorithm. 

Fact 2 The length of the chain of borders for a Lyndon Border is at most 2. 

Proof. The result follows, because, if a border is a Lyndon Word, then it cannot 
itself have a border (Observation 1(1)). 

□ 


Fact 3 Suppose Cft{s)[i\ = x. Then Lf3(s)[x\ = 0. 

Proof. Immediate from Fact 2. 

□ 

Now we are ready to propose a straightforward naive algorithm to compute the 
Lyndon Border Array C(3(s) for s[l.. n] as follows. 

Algorithm Naive Lyndon Border Array Construction 
1: Compute /3(s)[l.. n] 

2: for i = n —> 1 do 

3: Find the penultimate value k of the chain of borders for /3(s)[/j 

4: if s[l.. k\ is a Lyndon word then 

5: Set C(3(s)[i] = k 

6: else 

7: Set C/3(s)[i] = 0 

8 : end if 

9: end for 

The correctness of the algorithm follows directly from Facts 2 and 3. Now we 
discuss an efficient implementation of the algorithm. When we traverse through 
the chain of borders, we reach the penultimate value k and then the last value 0. 
Clearly, all we need is to check for each such chain, whether s[l.. k\ is a Lyndon 
word. So, we are always interested in finding whether s[l.. k] is a Lyndon word 
where /3(s)[fc] = 0. At this point the computation of SNLB/P (Section 3) will 
be applied. To give an example, by Corollary 8, we know that any border-free 
prefix to the left of SNLB/P is a Lyndon word and any border-free prefix to the 
right of SNLB/P is non-Lyndon. This gives us an efficient weapon to check the 
If statement of Line 4 of the above algorithm. Finally, as can be seen below, 
we can make use of a stack data structure along with some auxiliary arrays to 
efficiently implement the above algorithm. In particular, we simply keep an array 
Done[ 1.. n] initially all false, using a stack and the SNLB/P index (say r). The 
algorithm is presented below: 

Algorithm Efficient Lyndon Border Array Construction. 
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1: for i = 1 —>• n do 
2: Set Done[i\ =FALSE 

3: end for 

4: Compute /3(s)[l.. n] 

5: Compute SNLB/P using Algorithm SNLB/P. Say, s[l.. r] is the SNLB/P. 

6: for i = n —> 1 do 

7: if Done[i\ = TRUE then 

8 : continue > i.e., skip what is done below 

9: end if 

10: Compute the chain of /3(a) [*] and push each onto a stack S. 

11: Compute the penultimate value k of the chain of /3(s)[i]. 

12: while S is nonempty do 

13: Pop the value j from the stack 

14: if k < r then > i.e., s[l.. k\ is a Lyndon word 

15: p(s )\j] = k 

16: else 

17: p( s )\j] = 0 

18: end if 

19: Done[j ] =TRUE 

20: end while 

21: end for 


Clearly, the time complexity of the above algorithm depends on how many 
times a chain of borders is traversed. If we can ensure that a chain of borders 
is never traversed more than once, then the algorithm will surely be linear. 
To achieve that we use another array Vval[ 1.. n], initially all set to —1. This 
is required to efficiently compute the penultimate value of a chain of borders. 
The difficulty here arises because we may need to traverse a part of a chain of 
borders more than once through different indices because two different indices 
of the border array, /3(s), may have the same value. This may incur more cost 
and make the algorithm super-linear. To avoid traversing any part of a chain of 
borders more than once we use the array Vval[ 1.. n] as follows. Clearly, we only 
need to traverse the chain of borders to compute the penultimate value. So, as 
soon as we have computed the penultimate value k, for a chain, we store the 
value in the corresponding indices of Vval. To give an example, suppose we are 
considering the chain of borders ^(s) 1 ^], /3(s) 2 [i\, .., /3(s) m_1 [z], /3(s) m [i], where 
(3(s) m [i\ = 0, and suppose that the penultimate value is k, i.e., f3{s) m ~ 1 [i] = k. 
Now suppose further that the indices involved in the above chain of borders are 
i = , *2, ■ •, Then as soon as we have got the penultimate value, we 

update Vval[ii] = Vvalfa] = ■ ■ = Vval[i m - 1] = k. How does this help? If the 
same chain or part thereof is reached for computing the penultimate value, we 
can easily return the value from the corresponding Vval [1.. n] entry, and thus 
we never need to traverse a chain or part thereof more than once. This ensures 
the linear running time of the algorithm. 
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4.1 For Binary Alphabet 

We consider here the case of a binary alphabet £, where £ = {a, b} and a < b. 
Then a binary string s of length |s| = n, where s is a non-letter string (i.e., 
n > 1), can be expressed as: 

s[l ..n] = (a) p> (b) q '(a) Pl {b) qi • • • {a) Pm (&) 9m (2) 

for 0 < p',pi;.. ; p rn . q ', q \,..; q m . Considering our interest in non-empty binary 
Lyndon strings, which are border-free, we require the stronger condition that 
0 < p', q' < n and 0 < Pj,qj- Let C be the set of binary Lyndon words. 

Clearly, Equation. 2 represents the Run Length Encoding RLE(s) of a binary 
string s. For example, if s = ( a)(b)(aa)(b)(aaa)(bb)(aa)(b)(aa)(b ), then 

RLE(s) = (a) 1 (b) 1 (a) 2 (b) 1 (a) 3 (b) 2 (a) 2 (b) 1 (a) 2 (b) 1 

We are interested in the values of the parameters (exponents) in Equation. 
2 (p r ,pi ',.. ;p m , q', qi;..; q m ) which we will maintain in an auxiliary linked list 
TZ(s). Note that 7?.[0] = p' and 7£[1] = q'. So for the example string s above, 
71(a) = {1,1,2,1,3,2,2,1,2,1}. 

Proposition 17. For a given string s £ £ + , the Run Length Encoding RLE(s) 
and subsequently, the list 1Z(s) can be computed in time and space linear in the 
size of the given string s. 

Now we have the following Proposition to check if a given binary word is a 
Lyndon word or not (using the Run Length Encoding of the binary word). 

Proposition 18. Let C be the set of Lyndon words over an alphabet £, where 
£ = {a, b} and a < b. For a non-letter word £ = £\£i • • • L n and its corresponding 
exponents list 1Z{C), we have the following: 

Case m = 0, we have Pj,qj = 0 for j £ {1.. m} then £ £ C if and only if 

p',q' > 0. 

Case to > 0, we have p ', q' ,pj, qj > 0 then 

1. t £ C if and only if p' > pj for j £ {1.. m}. 

2. if p' = p m and p' > pj for j £ {1 .. to — 1 } then i £ C if and only if 
q' < q m A A < 0. 

3. if there exists pj such that p' = pj and p' > p m then £ £ C if and only if 
q' < qj A A < 0 for j £ {1.. to — 1}. 

where A is the index of Lyndon invalid point i.e. C-fail condition (As defined 
in Lemma. 11 and Lemma. 12). 

Lemma 19. For a given string s of length n = |s|, let the list 1Z be the list of 
exponents of Run Length Encoding of s, using the list 1Z it can be checked for 
each proper prefix of s if it is a Lyndon word or not (computing the array d'(s)) 
in 0{n ) time . 
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Proof. (Lemma 19) We first focus on computing the i-th element <F[i](s) of the 
array tf'(s). 

Clearly, by Proposition. 18, it takes constant number of comparisons to de¬ 
termine whether the prefix s[l.. i] is a Lyndon word or not. 

Since we have to repeat the steps in Proposition. 18 for n prefixes, therefore 
<F(s) can be done in linear time once 7 Z(s) is computed. 

□ 

While computing the border array /3(s) (using the Algorithm in [Lot05]), 
we can check conditions (1), (2) & (3) in Observation. 1 in constant time, by 
checking if £\ < £ n and /3[|7!|] < 0. 

We can check condition (4) (Observation. 1) by computing the parameters 
according to Proposition. 18 and checking the £-fail conditions (Lemma.il & 
12). 

So, we will evaluate the parameters (j>',pi', ■ ■ ',Pm, q', qi\ • ■; q m ) while com¬ 
puting the lists /3 [z], 1Z and £/3[z] simultaneously. 

Note that the z-th entry of Lyndon Border Array C/3[i ] correspond to one 
of the previous borders (that is also a Lyndon word) and hence we need to 
consult the value !F[/3[z]], we will set Cf3[i\ = /3[i\ if <? [/?[*]] = true. Otherwise, 
the algorithm recursively finds the longest border in the chain of borders (at 
position i) that is also a Lyndon word. 

An outline of the algorithm is as follows: 

1. Compute the border array /3(s). 

2. Compute the Run Length Encoding and subsequently the list 7Z(s). 

3. Compute \P(s) (Proposition. 18). 

4. Check for each prefix of the input string whether or not there exists a border 
of length b ^ 0 such that s[l.. b] £ C. 

A binary Lyndon word can also be expressed in terms of Lyndon properties 
of the integer parameters (exponents) given by its Run Length Encoding. For a 
binary string £, let 7 Z(£ a ) denote the encoding (as a string) of the subsequence of 
£ consisting of all letters a but no letter b (p',pi;.. ;p m ), similarly 7 Z(£b) denotes 
the encoding of the subsequence of £ consisting of all letters b ( qf , qi \.. ; q m ). 

Lemma 20. Let £ be a binary word with £[1\ = a, £[n\ = b and associated 
encodings TZ(£ a ) and 7 Z(£b). Then £ is a Lyndon word if and only if either 

(i) 7 Z(£ a ) is a Lyndon word on the alphabet {1 > 2 > 3 > • • •}, or 

(ii) 7 Z(£ a ) is a repetition of a Lyndon word as in (i) and TZ(£b) is a Lyndon 
word on the alphabet {1<2<3<---}. 

Proof. (Lemma 20) First we consider necessity. So suppose that (i) holds. 
Consider any rotation £ r a of £ a (including those with split runs of a’s). Then 
lZ(£ a ) < TZ{£ r a ) in lexoixler over {1 > 2 > 3 > •••}. Now suppose that (ii) 
holds. Then for a rotation £ r a of £ a , either lZ(£ a ) < 7 Z{l r a ) in lexorder over 
{1 > 2 > 3 >•••}, or 7 Z(£ a ) = 7 Z(£ r a ) and 7 Z(£b) < 7 Z(£^) in lexorder over 
{1 < 2 < 3 < •••}. 

For sufficiency, the conditions guarantee that £ £ C. 

□ 
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Run time analysis (For Binary Alphabet) Computing /3(s) can be done 
in linear time Proposition. 2. Also, run length encoding (hence the list 1Z (the 
exponents p' ,pi\ ■ .; p m , s', Qii ■ ■ ; Qm)) Proposition. 17 and the array >L( s ) can be 
done in linear time according to Proposition. 18 and Lemma. 19 respectively. 

Consequently Lyndon Border Array can be computed in linear time once 
/3(s), IZ(s) <L(s) is computed. 

The algorithm requires extra space for keeping the arrays >L(s) and f3(s) 
(each of length n). Additionally, the algorithm uses an auxiliary extra space to 
maintain the list 77(s) of length at most n. Therefore the total space required 
to run the procedure is linear to the length of the given string. 

Similar time/space efficiency can be achieved to compute Lyndon Border 
Array by simply applying Duval’s algorithm [Duv83] on the exponents of a’s 
and then if necessary on those for 6’s (the concept presented in Lemma. 20). 

Lemma 20 shows that a binary string can be decomposed into its Lyndon 
factors with a double application of Duval’s linear algorithm [Duv83], first on 
the exponents of a and then if necessary on those for b. 

5 Lyndon Suffix Array Computation 

The well-known suffix array of a string records the lexicographically sorted list of 
all of its suffixes. Our next contribution is to show how the Lyndon Suffix Array , 
like the original suffix array, can be constructed in linear time; for a string of 
length n it follows that the indexes in the Lyndon variant will be a subset of 
(1, 2,..., n}. We will exploit the elegant fact that Lyndon suffixes are nested: 

Fact 4 If the given string s is a Lyndon word, by Lyndon properties of Lyndon 
suffixes, the indexes in the Lyndon Suffix Array will necessarily be increasing. 

In order to efficiently construct the Lyndon suffix array we could directly 
modify the linear-time and space efficient method of Ko and Aluru [KA03] given 
for the original data structure - this would involve lex-extension ordering (lex- 
order for substrings, see [DS14]) along with Fact 4. We note that the Ko-Aluru 
method has also recently been adapted to non-lexicographic F-order and F- 
letters, and applied in a novel Burrows-Wheeler transform [DS14], and hence is 
quite a versatile technique. - since the modification here will be similar we will 
just outline the main steps. 

Let an L-letter £ = l\£i • ■ £ m substring denote the simple case of a Lyndon 
word such that l\ < £i for 2 < i < m, assumed to be of maximal length, that is, 
£i = £m+ 1 (if £m+ 1 exists); hence \£\ > 1. 

Since, apart from the last letter, there may not be Lyndon suffixes of a string, 
we perform a linear scan to record the locations of the minimal letter t\, say, in 
the string s. Observe also that either an L-letter is a Lyndon suffix, or it is the 
prefix of a Lyndon suffix - the point is that an L-letter is a well-defined chunk of 
text, a substring of the input s, as opposed to the classic single letter approach. 
In order to sort chunks of text lexicographically, we will apply lex-extension order 
defined as follows. 
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Definition 21. Suppose that according to some factorization Ttwo strings 
u, v £ S + are expressed in terms of nonempty factors: u = U 1 U 2 •■ ■ u m ,v = 
v±V 2 ■ ■ ■ v n . Then u <lex(f ) v if ond only if one of the following holds: 

(1) u is a proper prefix of v (that is, Ui = Vi for 1 < i < m < n); or 

(2) for some i £ 1.. min(m, n),Uj = Vj for j = 1,2,..,* — 1, and Ui < Vi (in 
lexicographic order). 

First, using a linear scan we apply the C/3 to record the indexes of all the 
Lyndon suffixes of s. The factorization T which we will use is that of decompos¬ 
ing the input into substrings of L-letters; in the case of unit length L-letters we 
concatenate them so that T has the general form: 

where u £ E* does not contain £\, t > 1, each ij > 0, every £j is an L-letter 
and \£t\ > 1. - in practice this just entails keeping track of occurrences of the 
letters i\ in s. Since we are computing Lyndon suffixes we can ignore u and 
hence assume that T has the form £±tf £2 ■ ■ ■tf£t■ 

An Lsuffix is a suffix of T commencing with . We now apply the Ko-Aluru 
linear method, consisting of three main steps, to the ^-suffixes: which we first 
outline followed by further detail for each. 

— Using a linear scan of the input string s and lex-extension ordering, divide 
all Usuffixes of s into two types: That is, let Si denote the suffix starting 
at index i , so Si = s[i..n]. Then the type S Lyndon suffixes are the set 
{si|sj < Si+i} and the type L Lyndon sufixes are the set {sj|sj > Sj-|_i}. 

— Sort by lex-extension all Usuffixes of type S in 0(n)-time using a modified 
Bucket Sort followed by recursion on at most half of the string. 

— Using a linear scan obtain the lex-extension order of all remaining Usuffixes 
(assumed to be type L) from the sorted ones. This step is obtained from 
observing that the type L Usuffixes occurring in s between two type S l- 
suffixes, Si and Sj where i < j, are already ordered such that Li <lex(F) Lk 
for i < k < l < j. 

At this stage we have computed an Usuffix array. There are two final steps 
for computing the Lyndon suffix array. Firstly, the classic suffix array for the 
last L-letter £ t (without the prefix £ 1 ) is processed directly using the Ko-Aluru 
method and the indexes inserted into the Usuffix array. Then we perform a linear 
scan of the Usuffix array, and applying Fact 4, by selecting only the sequence of 
increasing integers yields the Lyndon suffix array. 

The last suffix is both type S and L. The modification from lexicographic to 
lex-extension order follows from, firstly the linear factorization of the input into 
L-letters, and secondly that lex-extension ordering applies lexicographic order 
pairwise to L-letter substrings which each requires no more than time linear in 
the length of the L-letters - hence 0(n ) overall. The space efficiency follows from 
the original method [KA03]. 
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5.1 A Simpler algorithm for computing a Lyndon Suffix Array from 
a Suffix Array 


We present an alternative simple algorithm, derived from the classic suffix array, 
which also exploits the nested structure expressed in Fact 4. Suppose we are 
given the suffix array of the string s. Now our algorithm finds the largest suffix 
(max), and then searches inside it to find the second largest one and so on, 
taking advantage of the fact that the suffixes are already sorted in the suffix 
array. Repeatedly finding the max value can be implemented efficiently using 
the Range Minimum Query (RMQ) [BFCOO], which requires 0{n ) time pre¬ 
processing and then 0(1) time for each query. 


Lemma 22. Maximum range suffixes are Lyndon words: Suppose we are 
given the suffix array A of a string s. Let the set A4 be the set of maximum values 
of the range of suffixes A[0.. i], where i is the index of the i-th suffix li. Each 
suffix i £ A4 is a Lyndon word. 


Proof (Lemma 22). Suppose i is a max range suffix (i €E A4) at index i with 
order value Sb4[i] = m. Suppose there exists a suffix i" of t at index i" (w.r.t. 
to A) with order value A[i"\ = m", and also suppose (!' < L. Hence m" > m 
(A[i"} > A[i}) where i" < i, which contradicts the fact that m is the maximum 
value in the range A[0.. i\. Therefore, t is strictly smaller than all of its proper 
suffixes - we conclude that t is a Lyndon word. 

□ 


The linear-time method for computing the Lyndon suffix array is very simple. 
Below, we first outline the steps followed by the pseudo-code. 


1 . Compute the suffix array A of s$. 

2. Find the value max = MAX(A[0. . n]) and its index i in the suffix array, 
then add the value max to Lyndon Suffix array CS. 

3. Find the value max and its index i in the range A[0 .. i], max = MAX(A[0 .. i]) 
and i = lndexof(maa;, A), then add the value max to CS. 

4. Repeat step 2 until the set A[0 .. i] is empty. 
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Algorithm 1 Computing the Lyndon Suffix Array from a Suffix Array, 
procedure ComputeLSA(s) 
n <— |s|; C[1.. n] <— (—1)" 
t> compute the suffix array of 
A <- SA(s$) 
i t— n; j t— n; max t— 0 
while (A[0 .. z] ^ 0) do 

t> find the maximum value (and its index) in the range A[0 .. z]. 

(■ max,i ) <r- (A[idx],idx) \ idx = argma x it i x (A[idx\) for 0 < idx < i 

j j - 1 

C[j] t— max 

end while 

return C 

end procedure 


6 Conclusion 

In this article, we have extended two well-known data structures in stringology. 
We first have adapted the concept of a border array to introduce the Lyndon 
Border Array Cj3 of a string s, and have described a linear-time and linear- 
space algorithm for computing £(3(s). Furthermore, we have defined the Lyndon 
Suffix Array , which is an adaptation of the classic suffix array. By modifying 
the linear-time construction of Ko and Aluru [KA03] we similarly achieve a 
linear construction for our Lyndon variant. We also present a simpler algorithm 
to construct a Lyndon Suffix Array from a given Suffix Array. 

The potential value of the Lyndon Border Array is that it allows for deeper 
burrowing into a string to yield paired Lyndon patterned substrings. The Lyndon 
suffix array lends itself naturally to searching for Lyndon patterns in a string. 
If the given text or string has a sparse number of Lyndon words (as likely in 
English literature due to the vowels a, e often occurring internally in words), then 
the Lyndon suffix array may offer efficiencies. Polyrhythms, or cross-rhythms, 
are when two or more independent rhythms play at the same time nested 
Lyndon suffixes can exist in these rhythms. We propose that applications of 
these specialized data structures might arise in the context of the relationship 
existing between de Bruijn sequences and Lyndon words [FM78]. 

We propose that applications of Lyndon Border Array may arise in com¬ 
binatorics in relation to the Christoffel words. The methods used for these prob¬ 
lems often make use of structures equivalent to suffix trees in order to achieve 
efficient execution. 
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